DATA CLEANING

From EDA we came up with some outcomes like some data labels are undocumented. Lets try to clean them up and then visualize them

BASELINE MODEL

Basically, this is our base line score and have to find a better scoring Model Algo for thi problem. Baseline Accuracy : 0.77

FEATURE ENGINEERING

From EDA it leads that, SEX, MARRIAGE, EDUCATION columns consist of Categorical Data Hence lets, convert them to Objects

Data Splitting

Also, EDA shows us that the data was unbalanced: Lets see the distribution of samples in train dataset created above

Now let's sample them accordingly

A. RANDOM OVERSAMPLING:

This is basically sampling random cases from minority target data and adding to dataset

B. RANDOM UNDERSAMPLING

This is basically removing random cases from majority target data till we achieve desired level of balance

C. SMOTE: Synthetic Minority Oversampling Technique

Why SMOTE: Oversampling increases likelihood of Overfitting the model while, undrsampling decrease the number of record hence may affect accuracy by dissolving potential useful data.

How does SMOTE works ? |====> It basically create a line through examples and creates new samples with similar behavior on the basis of line relation.

So as of now we have 4 Datasets of differently balanced levels:

  1. Unbalanced Data
  2. Randomnly Undersampled
  3. Randomnly Oversampled
  4. SMOTE

Evaluation Criteria: K Folds accross AUC - ROC

The best AUC score for Logistic Regression is seen in Under sampling data with an Test AUC score of 0.65 and K-fold score of 0.615

NAIVE BAYES

The best score for Naive Bayes is registered when trained with the under sampling data with an Test AUC score of 0.74 and train K-Fold score of 0.70 after it was standardized

K-Nearest Neighbour

The best score for KNN is registered when trained with the under sampling data with an Test AUC score of 0.74 and train K-Fold score of 0.68

Decision Tree

Random Forest

ADA Booster

Gradient Boosting Classifier

XG Boost Classifier